Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new batch overload (bucket factory) #833

Closed
wants to merge 2 commits into from

Conversation

leandromoh
Copy link
Collaborator

@leandromoh leandromoh commented Nov 20, 2021

It is useful to have an overload where user can pass the bucket factory. Pratical example:

  • user can customize to get arrays from ArrayPool
  • user can customize to reuse the array when consume each bucket once, always foward

In both cases are desirable to have the option to custome Batch behaviour, expecially when dealing with large arrays (e.g. +100k elements) where allocate a new array for every new batch is expensive.

Once added this new overload, users are able to create their customizated Batch version in their own applications, like the bellow for example.

public IEnumerable<IEnumerable<T>> BatchFoward<T>(this IEnumerable<T> source, int size)
{
    int lastSize = 0;
    T[] temp = null;

    try
    {
        var result = source.Batch(size, x => x.Take(lastSize), count =>
        {
            lastSize = count;

            return temp ??= ArrayPool<T>.Shared.Rent(count);
        });

        foreach (var batch in result)
            yield return batch;
    }
    finally
    {
        if (temp is { })
            ArrayPool<T>.Shared.Return(temp);
    }
}

Comment on lines +182 to +183
var newArray = bucketFactory(count);
Array.Copy(bucket, 0, newArray, 0, count);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the same as Array.Resize(ref bucket, count) but manually to use bucketFactory

@leandromoh leandromoh requested a review from atifaziz November 20, 2021 02:20
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly understand the argument for reducing memory pressure when working with very large batch sizes and thus arrays, but I think the proposed solution of simply adding an array factory argument to Batch is very tricky to get right for the caller. In the simples of cases, it will lead to incorrect usage and bugs because it requires side effects (like lastSize in your example and test), additional work like clipping each bucket (because the actual array length may be greater than the requested bucket size, especially when renting from a pool) and finally returning each array to the pool at the right time for the reuse to be effective through the iterations. You can see that even in the example you posted in your opening comment, only the last bucket is returned to the pool whereas the rest leak. All this also make the composing with Batch very hard.

I think a sequence of buckets is probably the wrong return type for such an overload because it claims that each bucket is independent of another when that's not true, especially if the backing array is pooled. Each bucket will end up looking at the same array. This can cause even more severe logical bugs.

I've experimented with what a design could look for a version of Batch using pooled arrays and will share it through another PR. Unfortunately, it's not a simple set of changes or tweaks so it's best to share, discuss and review the code in a PR.

@atifaziz
Copy link
Member

atifaziz commented Oct 29, 2022

I've experimented with what a design could look for a version of Batch using pooled arrays and will share it through another PR. Unfortunately, it's not a simple set of changes or tweaks so it's best to share, discuss and review the code in a PR.

@leandromoh See #856 and review welcome.

I think a sequence of buckets is probably the wrong return type

I think I managed to workaround this to still return a sequence.

atifaziz added a commit that referenced this pull request Nov 13, 2022
This is a squashed merge of PR #856 that supersedes #833.

Co-authored-by: Stuart Turner <[email protected]>
@atifaziz
Copy link
Member

This has been superseded by #856.

@leandromoh Happy to get your review in future follow-up issue or PR when you get the time.

@atifaziz atifaziz closed this Nov 13, 2022
@leandromoh leandromoh deleted the feature/bucket-factory branch November 16, 2022 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants